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Abstract —An analysis of breast cancer incidences in women 
and the relationship between ethnicity and snrvival rate has been 
an ongoing study with recorded incidences of missing values 
in the secondary data. In this paper, we study and report the 
results of breast cancer survival rate by ethnicity, age and income 
groups from the dataset collected for 53593 patients in South 
East England between the years 1998 and 2003. In addition to 
this, we also predict the missing values for the ethnic groups in 
the dataset. The principle findings in our study suggest that: 1) 
women of white ethnicity in South East England have a highest 
percentage of survival rate when compared to the black ethnicity, 
2) High income groups have higher survival rates to that of lower 
income groups and 3) Age groups between 80-95 and 20-35 have 
lower percentage of survival rate. 


I. Introduction 
A. Motivation for imputation 

Surveys often contain missing values and the process of 
replacing these missing values with substituted values is called 
as imputation 0 0 - These missing values might induce 
more useful information in predicting the trends and statistics. 
For conveniences of analysis, statisticians normally discard 
observations containing the missing values. This results in the 
reduction of the sample size for the analysis and interesting 
information might get lost. This can be a serious hindrance not 
only by misleading the results Q but also by producing overly 
simplified conclusions. Hence, there is a need for imputation 
when dealing with the data analysis especially in the health 
care domain. 


From the literature the missing value problem can be 
categorised into two main types: 1) missing values at random 
and 2) missing values not at random m- Missing values 
not at random can again be of two types 1) for discrete 
values (group membership) GD and 2) for continues values. 
Methods such as: 1) mean substitution, 2) median substitu¬ 
tion could be utilises for computing the missing values at 
random 1121 m- Other method including, maximum votes 
and nearest neighbours techniques can also be employed for 
this purpose. Regression based substitution can be employed 
for continuous missing values that are not at random. The 
aforementioned methods fail when imputing the values for 
group membership (refer to for computing missing values 
for group memberships). When dealing with missing values for 
group memberships, methods such as: 1) analysis of variance 
(AnoVa) |^, 2) analysis of mean and 3) classification based 
methods 1 1^ could be employed. In the earlier methods, the 
variance and mean for each group is calculated and mean 


square error is computed between the observations of group 
and the new group that contains missing values. The group 
member with minimum mean square error is assigned to the 
missing value. In the classification based approach, the missing 
values are considered as the unknown class labels and a 
classifier model is built on the existing observations. Using 
the classifier model the missing values are then computed. 
In this process, the classifier must be cross validated for the 
performance. 

Using traditional statistical methods sometimes might give 
us overly simplified solutions, hence there is a need for statis¬ 
tical machine learning algorithms which are a combination of 
both probability and statistics p5| . Therefore, we in this study 
use classification based method to impute the missing values 
for group membership, i.e ethnicity group. 


B. Motivation underlying the current study 

Breast cancer is a malignant tumour that originates from 
the cells of the breast and grows into surrounding and distant 
tissues (Tj. It is the second most common cancer in women 
which makes this study important. The relationship between 
ethnicity, survival percentage, and breast cancer is complex. 
Studies carried out in the past have shown that women of 
different ethnicity have different rates of survival from breast 
cancer after diagnosis Q. Many comparisons and links have 
been made between ethnicity, income and survival rates in 
women diagnosed with breast cancer 0 0 0. According 
to a study carried out by the Cancer Research UK after 
grouping women into ethnicity groups aged between 15 and 
64 years, the percentage of survival from breast cancer of 
those of white ethnicity is relatively higher at 91.4% than 
women of black ethnicity with survival percentage of 85.0%. 
The National Cancer Intelligence Network have produced a 
report Q on ‘Cancer Incidence and Survival by Major Ethnic 
Groups in England between 2002 and 2006’. This report shows 
that the survival rates of women with breast cancer categories 
into four major ethnicity groups namely: White, Asian, Black 
and Unknown, white women had a higher rate of survival 
compared to those of black ethnicity. Bradley et al. in Q, 
showed that low socioeconomic status was associated with 
late-stage breast cancer at diagnosis and mostly in death. 


'http://www.canceiTesearchuk.org/ 
^ http://www.ncin.org.uk/ 





C. Objectives and contribution 

The main objectives of this study are two-fold. On one 
hand, from the computational perspective, we would like 
to examine the feasibility of using machine learning based 
classifier (e.g. Naive Bayes) in filling up the missing values in 
the data. On the other hand, from the scientific perspective, we 
would like understand whether the insights from breast cancer 
analytics correspond to what clinicians would expect. For 
example to answer the following questions that are exploratory 
in nature. 

• Does survival rate get affected by the age and ethnic 
group of the patient? for instance, are black and older 
women more likely to die if they have breast cancer. 

• Does financial status of the patient have any effect 
on the survival rate? i.e, do wealthier have lower 
possibility of dying from the breast cancer. 

Our contribution in this paper is to show how a machine 
learning based classifier can be utilised to impute the missing 
values in the health care data and obtain insights. When there 
are many classification methods available in the literature, it is 
difficult to choose which one to use. In such a case simplicity, 
reputation of the method and experience of its usage can 
influence the selection process. Therefore, in this study, we 
have chosen Naive Bayes classifier to compute the missing 
values because of its simplicity and inexpensiveness. 


D. Organisation 

The organisation of the paper is as follows; In sectioijB we 
present and analyse the breast cancer dataset. In section III we 
discuss our methodology. Results are discussed in section IV 
Finally, in section]^ we draw conclusions and summarise with 
the discussions. 


II. Data preprocessing 

The dataset is collected for 53593 breast cancer incidences 
in women taken in South East England between the years 
1998 and 2003. The initial dataset consists of 13 features 
however some of these features are simply an alternative way 
of representing the existing ones. Therefore these features are 
removed from the dataset. Eeatures such as ’Year of Diagnosis’ 
and ’Year of Death or Censored’ are removed as this data was 
available to us within the ’Survival’ feature. We also have 
removed the single year ’Age at diagnosis’ feature as we 
already have this information within the ’Age’ feature. This 
left us with a final dataset of 9 features as shown in Table |I] 
along with its Data format. 

TABLE I. Table showing the features and their format. 


Feature 

Data Format 

Income Quintile 

1 = (Most Affluent) to 5 = (Most Deprived) 

Age at Diagnosis Group 

0 = (0-4); 5 = (5-9) to 100 = (100+) 

Ethnic Group 

Ethnic Groups (Table 2) 

Radiotherapy 

0=No; l=Yes 

Chemo Therapy 

0=No; l=Yes 

Hormone Therapy 

0=No; l=Yes 

Cancer Surgery 

0=No; l=Yes 

Survival days 

Total no. of days 

Death of Breast Cancer 

0=No; l=Yes 


The second stage of our data preprocessing is to convert 
the ethnic group data from nominal to indices (Table E and 


remove header labels from the dataset so that it can be used 
for further data analysis. 

TABLE 11. Nominal values of ethnic groups and their 

CORRESPONDING NUMERICAL VALUE. 


Ethnicity Group 

Nominal Values 

Indices 

White 

W 

1 

Not Known 

NK 

2 

Any Other 

0th 

3 

Black Caribbean 

BC 

4 

Chinese 

c 

5 

Indian 

In 

6 

Black African 

BA 

7 

Pakistani 

P 

8 

Black Other 

BO 

9 

Asian Other 

AO 

10 

Mixed 

M 

11 

Bangladeshi 

Ba 

12 


A. Demographics 

The age group distribution in the dataset as seen in Eigure[2 
shows an expected normal distribution of age within the pa¬ 
tients and indicates the highest frequency of them are between 
50 and 65 years of age. 
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Fig. 1. Distribution of age within the patients. 

The most common cancer treatment taken by women of 
white ethnicity is radiotherapy as seen in Eigure However 
this seems to be the least common treatment for women of 
black African ethnicity. 

Eigure|^shows that the women from the white ethnic group 
are quite evenly distributed in terms of their socioeconomic 
deprivation. 

Looking at the average of survival days across the ethnic 
groups shows that women of Chinese ethnicity have the highest 
days of survival from breast cancer (see, Eigure |^. It also 
highlights that although the proportion of women of white 
ethnicity is significantly higher than any other groups, an 
average count gives a better indication of the caner survival 
rate across the ethnicity. Eigure |7] shows that the proportion 
of breast cancer survival compared to death is much higher 
within the white ethnic group and similar in black Caribbean 
and Indian ethnicity. 
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Fig. 2. Different treatments for breast cancer. 


Fig. 4. Average survival days across the ethnic groups. 
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for the new data belonging to the classes given that the features 
are independent of one another in each class. 

The Naive Bayes classifier involves two stages. The first 
stage is training, where the probabilities of every features’ 
parameter given each class as well as the probability of 
each class are estimated. These are known as the Likelihood 
P{X\C) and Class Prior probability P{C) respectively. The 
second stage is prediction, where the posterior probability 
algorithm (Eq.[T]) calculates the probability of each class given 
the parameters of each feature in the new data. Finally it 
predicts the class with the highest posterior probability as the 
result. As the features of our dataset are also assumed to be 
independent of each other and the class we believe that using 
the Naive Bayes algorithm will give us the best output. 


P{C\X) 


P{X\C)P{C) 
■ Pix) 
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Fig. 3. Distribution of income within the patients. 


III. Methodology 

A. Naive Bayes for imputation 

In order to fill up the ethnicity of the records with unknown 
ethnicity group, we decided to take a supervised-learning 
approach. The supervised-learning (machine learning) uses 
known set of input data and response to the data to build a 
predictor model that will generate predictions for the response 
to the new set of data. Since our dataset mainly consisted of 
binary and nominal data values we determined that this was 
a classification problem and chose Naive Bayes classification 
algorithm to model our predictor. The Naive Bayes algorithm 
seemed to be the optimal choice despite it having a low 
predictive accuracy because it handles categorical predictors 
very well and its speed and memory usage are good for 
simple distributions. More importantly this algorithm is easy to 
interpret because it is based on finding the posterior probability 


B. Parameter tuning 

The Naive Bayes Classifier supports a number of proba¬ 
bility distribution estimates. Based on theory the Multivariate 
Multinomial Distribution is the ideal distribution for us to 
choose as our dataset consists of categorical features; however 
we decided to conduct a set of parameter tuning experiments 
with the different distribution options available in Matlab to 
observe the legitimacy of the theory. We chose: l)Normal 
(Gaussian), 2) Kernel, 3) Multinomial and 4) Multivariate 
Multinomial and set the class prior probability as uniform for 
all cases so that the probabilities are equal for all classes, sub¬ 
sequently, we decided to choose the best distribution depending 
on the highest value for accuracy after cross validations. 

C. Cross validation 

In order to run the cross validation we first extracted the 
records of unknown ethnic group from the original dataset 
and created the training data with the remaining records. We 
decided to use the K Fold Cross validation process in order 






























to enhance the accuracy of the results and a value of 10 
for K seemed ideal for such a large dataset. The 10 fold 
cross validation involved dividing our training data into 10 
sets, then setting aside one set for validation we used the 
remaining 9 sets to train the Bayesian classifier. Then we cross 
validated the results with the validation set and calculated its 
accuracy using unbiased F-measure ig. This cross validation 
was computed 10 times where every time a different set was 
used for validation and then an average of the F-measure 
percentages are calculated. After running this for each of the 
distribution parameters we chose to use the distribution with 
the highest accuracy. We ran the prediction model for the four 
distributions we considered. 

D. Performance evaluation 

The F-measure is a good way to calculate the performance 
of a prediction model by checking the predicted results against 
the actual results. The process involves finding the total number 
of True Positives (tp), True Negatives (tn), False Positives (fp) 
and False Negatives (fn) from the result comparison. Then 
finding the Precision and Recall using the equations in Table [nl| 
where Precision is the ratio of number of correct results to the 
number of all returned results and Recall is the ratio of the 
number of correct results to the number of results that should 
have been returned [11]. Finally the unbiased F-measure is 
then calculated by finding the harmonic mean of the Precision 
and Recall rates. 

TABLE III. Performance measurement methods 


Method 

Formula 

Recall 

Re — 

Precision 

- TP 

^ ' TPf-FP 

F-measure 

IT- _ n .,. PrXRe 

r - z* 


E. Imputation 

Once the dataset was finally ready to be classified by the 
Bayesian model we assigned the previously extracted records 
of the unknown ethnic group as our testing dataset and keeping 
all of the remaining data records for training and the Ethnic 
group feature was assigned as the class label. The testing 
dataset consisted of 18595 records which is around 35% of 
the original records. This actually gives a close 30:70 ratio 
between the testing and training which is optimal as previ¬ 
ously mentioned. Once the predicted results were obtained we 
integrated the unknown records back into the original dataset 
and replaced the unknown values with the predicted ethnic 
groups. 


IV. Results 


A. Cross validation 


Table IV indicates that fitting the Bayesian model with 
a ’Kernel’ distribution with uniform prior and ’Gaussian’ 
distribution with no prior as parameters give the most accurate 
94.10% and 94.01% results respectively when compared to the 
other distributions. Therefore, we consider predictions based 
on kernel distribution for the imputation. 


TABLE IV. F-measure % according to each distribution. 


Distribution type 

F-measure % 

Gaussian with no prior 

94.01% 

Gaussian with uniform prior 

48.7% 

Kernel with uniform prior 

94.10% 

Multinomial with uniform prior 

41.98% 

Multinomial Multivariate with uniform prior 

50.80% 


B. Ethnic groups vs survival rates 

1) Before imputation: Figure shows the distribution of 
ethnicity within our original dataset. According to that, the 
majority of our samples are white women (58%). A high 
percentage (35%) of the samples are from unknown ethnic 
groups. Other ethnic groups exist in small percentages with 
Bangladesh being the lowest. The high number of white 
women is due to socio-demographic reasons. The data was col¬ 
lected in Southwest of England where most of the population 
is of white ethnic group. This is justified by the data produced 
by the Office for National Statistics census data, UK jlT) . The 
population in South East England by ethnic group in 2009 
contains 90.7% of white ethnicity. 
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Fig. 5. Distribution of ethnic groups with the majority of participants being 
white. 


In this section, we present the results for the following: 

• Cross validation results for considered distributions. 

• Show the effect of ethnicity on the survival rate. 

• Show the impact of age on the survival rate. 

• Show the implication between financial status and the 
survival rate. 


The distribution of data affects our results due to the 
unequal number of samples between the ethnic groups. In order 
to avoid that, we convert the existing numbers to percentages 
so as to make results more reliable. 

The results for the mortality rate according to ethnicity 
show that the highest number of people that died of breast 
cancer is in white ethnic group (Eigure [5i. This does not 
necessarily mean that this group is more probable to die from 





















breast cancer. In order to get the possibility of each ethnic 
group facing cancer we translate our results in percentages 
within each group. 
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Fig. 8. Percentage of predicted data for each ethnicity. 


Fig. 6. Distribution of ethnic groups with the majority of participants being 
white. 

Figure]^ shows that white women have lower mortality rate 
than black African, mixed and black other. 
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Fig. 7. Death and Survival from breast cancer across ethnic groups. 

2 ) After imputation: Figure shows the predicted values 
for the unknown ethnicity records and indicates that the 
majority of them belong to the white ethnic group. 

Figure depicts, ‘For women aged between 15-64, the 
percentage of survival from Breast Cancer of those of white 
ethnicity is likely to be higher than those of black ethnicity’ 
as the white women are shown to have a 84% survival rate 
compared to a 77% survival rate for the women belonging to 
the Black ethnic group. 

Figure [7] and Figure shows the comparison between the 
ethnic group distribution before and after prediction, respec- 


Percentage of Survival by Major Ethnic Groups between 15-64 Years 

901-^^^- 
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Fig. 9. Percentage of survival rate in white is higher than black ethnicity. 


tively where all unknown records are classified into the existing 
ethnic groups based on the predicted percentages obtained for 
each ethnicity, as shown in Figure Similarly the comparison 
of numerals before and after imputation is shown in Table [V] 

C. Age vs survival rates 

Similarly, we produce the results for the relationship be¬ 
tween ages and mortality. 

Figure depicts that ages from 50 to 60 had the lowest 
possibility of dying from breast cancer. The highest death 
possibilities are detected in the ages between 80 and 95. 
Another interesting information is that high death possibilities 
are detected in the earlier ages of 20-35. This might be because 
younger people are not properly informed or do not visit their 
doctors in a frequent basis in comparison to older women. 



























^oOeath from Breast Cancer grouped by Ethnicity after Prediction 
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Fig. 10. Death from Breast Cancer grouped by Ethnicity after prediction. 



Fig. 12. Distribution of income within the patients. 

TABLE V. COMPARISON BETWEEN THE ETHNIC GROUP DISTRIBUTION 
BEFORE AND AFTER IMPUTATION. 



With Missing data 

After Prediction 

Ethnicity 

Death 

Survival 

Death 

Survival 

White 

6072.00 

25037.00 

7184.00 

31578.00 

Not Known 

3385.00 

15210.00 

0 

0 

Any Other 

230.00 

1249.00 

986.00 

4358.00 

Black Caribbean 

145.00 

507.00 

324.00 

1156.00 

Chinese 

14.00 

103.00 

140.00 

1770.00 

Indian 

113.00 

526.00 

260.00 

1576.00 

Black African 

95.00 

249.00 

274.00 

595.00 

Pakistani 

23.00 

98.00 

92.00 

376.00 

Black Other 

54.00 

158.00 

147.00 

463.00 

Asian Other 

22.00 

148.00 

68.00 

878.00 

Mixed 

30.00 

83.00 

470.00 

234.00 

Bangladeshi 

9.00 

33.00 

223.00 

345.00 


approves that white women have higher survival percent¬ 
age than black women with 91.4% and 85%, respectively. 
Similarly, older woman and lower income groups have high 
mortality rates. 


Fig. 11. Mortality rate according to age groups. 

D. Income vs survival rate 

The final objective that we are interested is know whether 
the financial status effects the death rate. Income feature indi¬ 
cates the financial status of patients and values from 1 (richest) 
to 5(poorer) are used. In Figure we see that wealthier 
patients have lower death rates. This is probably because they 
can afford better treatment facilities. 


There always exists some limitations with the data col¬ 
lection. In fact, by examining the breast cancer dataset, we 
can notice a clear imbalanced number of participants between 
the different ethnic groups where white females were rep¬ 
resenting more than half of the population versus very few 
numbers amongst all other ethnicity. This limitation might 
have certainly misled our prediction results which may explain 
the low F-measure percentages obtained on other distributions 
excluding kernel and Gaussian. 


V. Conclusions and Discussions 

After obtaining and appraising our results, we affirm that 
the type of dataset to be classihed plays a role in selecting 
the appropriate distribution type for the Bayesian classifier. 
Based on our results, kernel distribution has the best F-measure 
percentage amongst all other distributions. 

Comparing our results with previous statistical research 
in m and Q, we can confirm that our scientihc objectives are 
consistent with their hndings. In fact, referring to Qand pH) 
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